Search CORE

202 research outputs found

RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

Author: Gao Xin
Kim Ji-Sung
Rzhetsky Andrey
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 27/04/2018
Field of study

Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), and area under the curve for receiver operating characteristic plots (all

p < 10^{-6}

). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases

arXiv.org e-Print Archive

Directory of Open Access Journals

Representation of research hypotheses

Author: Rzhetsky Andrey
Soldatova Larisa N
Publication venue
Publication date: 01/01/2011
Field of study

BACKGROUND: Hypotheses are now being automatically produced on an industrial scale by computers in biology, e.g. the annotation of a genome is essentially a large set of hypotheses generated by sequence similarity programs; and robot scientists enable the full automation of a scientific investigation, including generation and testing of research hypotheses. RESULTS: This paper proposes a logically defined way for recording automatically generated hypotheses in machine amenable way. The proposed formalism allows the description of complete hypotheses sets as specified input and output for scientific investigations. The formalism supports the decomposition of research hypotheses into more specialised hypotheses if that is required by an application. Hypotheses are represented in an operational way – it is possible to design an experiment to test them. The explicit formal description of research hypotheses promotes the explicit formal description of the results and conclusions of an investigation. The paper also proposes a framework for automated hypotheses generation. We demonstrate how the key components of the proposed framework are implemented in the Robot Scientist “Adam”. CONCLUSIONS: A formal representation of automatically generated research hypotheses can help to improve the way humans produce, record, and validate research hypotheses. AVAILABILITY: http://www.aber.ac.uk/en/cs/research/cb/projects/robotscientist/results

Aberystwyth Research Portal

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Brunel University Research Archive

Imitating Manual Curation of Text-Mined Facts in Biomedicine

Author: Andrey Rzhetsky
Andrey Rzhetsky
Ivan Iossifov
Raul Rodriguez-esteban
Raul Rodriguez-esteban
See Profile
See Profile
Publication venue: Public Library of Science
Publication date: 01/01/2005
Field of study

Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts—to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine

CiteSeerX

Public Library of Science (PLOS)

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Directory of Open Access Journals

PubMed Central

Self-Correcting Maps of Molecular Pathways

Author: Rzhetsky Andrey
Weinreb Chani
Zheng Tian
Publication venue: Public Library of Science
Publication date: 01/01/2006
Field of study

Reliable and comprehensive maps of molecular pathways are indispensable for guiding complex biomedical experiments. Such maps are typically assembled from myriads of disparate research reports and are replete with inconsistencies due to variations in experimental conditions and/or errors. It is often an intractable task to manually verify internal consistency over a large collection of experimental statements. To automate large-scale reconciliation efforts, we propose a random-arcs-and-nodes model where both nodes (tissue-specific states of biological molecules) and arcs (interactions between them) are represented with random variables. We show how to obtain a non-contradictory model of a molecular network by computing the joint distribution for arc and node variables, and then apply our methodology to a realistic network, generating a set of experimentally testable hypotheses. This network, derived from an automated analysis of over 3,000 full-text research articles, includes genes that have been hypothetically linked to four neurological disorders: Alzheimer's disease, autism, bipolar disorder, and schizophrenia. We estimated that approximately 10% of the published molecular interactions are logically incompatible. Our approach can be directly applied to an array of diverse problems including those encountered in molecular biology, ecology, economics, politics, and sociology

CiteSeerX

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Detecting Coevolution in and among Protein Domains

Author: Andrey Rzhetsky
Chen-Hsiang Yeang
David Haussler
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Correlated changes of nucleic or amino acids have provided strong information about the structures and interactions of molecules. Despite the rich literature in coevolutionary sequence analysis, previous methods often have to trade off between generality, simplicity, phylogenetic information, and specific knowledge about interactions. Furthermore, despite the evidence of coevolution in selected protein families, a comprehensive screening of coevolution among all protein domains is still lacking. We propose an augmented continuous-time Markov process model for sequence coevolution. The model can handle different types of interactions, incorporate phylogenetic information and sequence substitution, has only one extra free parameter, and requires no knowledge about interaction rules. We employ this model to large-scale screenings on the entire protein domain database (Pfam). Strikingly, with 0.1 trillion tests executed, the majority of the inferred coevolving protein domains are functionally related, and the coevolving amino acid residues are spatially coupled. Moreover, many of the coevolving positions are located at functionally important sites of proteins/protein complexes, such as the subunit linkers of superoxide dismutase, the tRNA binding sites of ribosomes, the DNA binding region of RNA polymerase, and the active and ligand binding sites of various enzymes. The results suggest sequence coevolution manifests structural and functional constraints of proteins. The intricate relations between sequence coevolution and various selective constraints are worth pursuing at a deeper level

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

Recommended from our members

How to Get the Most out of Your Curation Effort

Author: Rzhetsky Andrey
Shatkay Hagit
Wilbur W. John
Publication venue
Publication date: 21/12/2023
Field of study

Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology.</p

Knowledge UChicago

Recommended from our members